Nomenclature-based data retrieval without prior annotation: facilitating biomedical data integration with fast doublet matching
نویسنده
چکیده
Assigning nomenclature codes to biomedical data is an arduous, expensive and error-prone task. Data records are coded to to provide a common representation of contained concepts, allowing facile retrieval of records via a standard terminology. In the medical field, cancer registrars, nurses, pathologists, and private clinicians all understand the importance of annotating medical records with vocabularies that codify the names of diseases, procedures, billing categories, etc. Molecular biologists need codified medical records so that they can discover or validate relationships between experimental data and clinical data. This paper introduces a new approach to retrieving data records without prior coding. The approach achieves the same result as a search over pre-coded records. It retrieves all records that contain any terms that are synonymous with a user's query-term. A recently described fast algorithm (the doublet method) permits quick iterative searches over every synonym for any term from any nomenclature occurring in a dataset of any size. As a demonstration, a 105+ Megabyte corpus of Pubmed abstracts was searched for medical terms. Query terms were matched against either of two vocabularies and expanded as an array of equivalent search items. A single search term may have over one hundred nomenclature synonyms, all of which were searched against the full database. Iterative searches of a list of concept-equivalent terms involves many more operations than a single search over pre-annotated concept codes. Nonetheless, the doublet method achieved fast query response times (0.05 seconds using Snomed and 5 seconds using the Developmental Lineage Classification of Neoplasms, on a computer with a 2.89 GHz processor). Pre-annotated datasets lose their value when the chosen vocabulary is replaced by a different vocabulary or by a different version of the same vocabulary. The doublet method can employ any version of any vocabulary with no pre-annotation. In many instances, the enormous effort and expense associated with data annotation can be eliminated by on-the-fly doublet matching. The algorithm for nomenclature-based database searches using the doublet method is described. Perl scripts for implementing the algorithm and testing execution speed are provided as open source documents available from the Association for Pathology Informatics (www.pathologyinformatics.org/informatics_r.htm).
منابع مشابه
Doublet method for very fast autocoding
BACKGROUND Autocoding (or automatic concept indexing) occurs when a software program extracts terms contained within text and maps them to a standard list of concepts contained in a nomenclature. The purpose of autocoding is to provide a way of organizing large documents by the concepts represented in the text. Because textual data accumulates rapidly in biomedical institutions, the computation...
متن کاملFuzzy C-Means Clustering for Biomedical Documents Using Ontology Based Indexing and Semantic Annotation
Search is the most obvious application of information retrieval. The variety of widely obtainable biomedical data is enormous and is expanding fast. This expansion makes the existing techniques are not enough to extract the most interesting patterns from the collection as per the user requirement. Recent researches are concentrating more on semantic based searching than the traditional term bas...
متن کاملSemantic Annotation and Summarization of Biomedical Literature
Semantic Annotation and Summarization of Biomedical Literature Lawrence Harold Reeve, Jr. Hyoil Han, Ph.D. Advancements in the biomedical community are largely documented and published in text format in scientific forums such as conferences and journals. To address the scalability of utilizing the large volume of text-based information generated by continuing advances in the biomedical field, t...
متن کاملDisease Ontology: a backbone for disease semantic integration
The Disease Ontology (DO) database (http://disease-ontology.org) represents a comprehensive knowledge base of 8043 inherited, developmental and acquired human diseases (DO version 3, revision 2510). The DO web browser has been designed for speed, efficiency and robustness through the use of a graph database. Full-text contextual searching functionality using Lucene allows the querying of name, ...
متن کاملDealing with Uncertainty in Lexical Annotation
We present ALA, a tool for the automatic lexical annotation (i.e. annotation w.r.t. a thesaurus/lexical resource) of structured and semi-structured data sources and the discovery of probabilistic lexical relationships in a data integration environment. ALA performs automatic lexical annotation through the use of probabilistic annotations, i.e. an annotation is associated to a probability value....
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- In silico biology
دوره 5 3 شماره
صفحات -
تاریخ انتشار 2005